load("~/Documents/Nonparametric Statisics/Project/clean data/functional/df_week.RData")

first we start with outlier detection in the functional case

fd_week <- fData(1:7,as_tibble(df_week[,3:9]))
plot(fd_week,lwd = 3,xlab = "day",ylab = "dayly number of crashes",main = "dayly crashes in each week")

functional bagplot:

week_fbagplot <- fbplot(fd_week, main="Magnitude outliers weekly data")

the default F is:

week_fbagplot$Fvalue
[1] 1.5

no outliers found

we need to remove the weeks that cause problems witth 0 (1,52,53)

df_week2 <- df_week %>% filter(week_of_year %in% 2:51)
fd_week2 <- fData(1:7,as_tibble(df_week2[,3:9]))
week_fbagplot2 <- fbplot(fd_week2, main="Magnitude outliers week data",
                                  adjust = list( N_trials = 20,trial_size = fd_week2$N,
                                                 VERBOSE = TRUE ))
 * * * Iteration  1  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  2  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  3  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  4  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  5  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  6  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  7  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  8  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  9  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  10  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  11  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  12  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  13  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  14  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  15  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  16  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  17  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  18  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  19  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  20  /  20 
 * * * * beginning optimization
 * * * * optimization finished.

the chosen F value is:

week_fbagplot2$Fvalue
[1] 0.6115131

the outlying years are:

df_week2[week_fbagplot2$ID_outliers,1:2]

first years ando covid year

outiliergram:

invisible(out_week<- outliergram(fd_week,adjust = F,lwd = 3,display = F))

the found outliers are:

df_week[out_week$ID_outliers,1:2]

the plot of the original function is not working.

adjusting the F:

out_week <- outliergram(fd_week,lwd = 3,adjust = list( N_trials = 20,trial_size = 8*fd_week$N,
                                                 VERBOSE = TRUE ),display = FALSE)
 * * * Iteration  1  /  20 
 * * * * beginning optimization
 * * * Iteration  2  /  20 
 * * * * beginning optimization
 * * * Iteration  3  /  20 
 * * * * beginning optimization
 * * * Iteration  4  /  20 
 * * * * beginning optimization
 * * * Iteration  5  /  20 
 * * * * beginning optimization
 * * * Iteration  6  /  20 
 * * * * beginning optimization
 * * * Iteration  7  /  20 
 * * * * beginning optimization
 * * * Iteration  8  /  20 
 * * * * beginning optimization
 * * * Iteration  9  /  20 
 * * * * beginning optimization
 * * * Iteration  10  /  20 
 * * * * beginning optimization
 * * * Iteration  11  /  20 
 * * * * beginning optimization
 * * * Iteration  12  /  20 
 * * * * beginning optimization
 * * * Iteration  13  /  20 
 * * * * beginning optimization
 * * * Iteration  14  /  20 
 * * * * beginning optimization
 * * * Iteration  15  /  20 
 * * * * beginning optimization
 * * * Iteration  16  /  20 
 * * * * beginning optimization
 * * * Iteration  17  /  20 
 * * * * beginning optimization
 * * * Iteration  18  /  20 
 * * * * beginning optimization
 * * * Iteration  19  /  20 
 * * * * beginning optimization
 * * * Iteration  20  /  20 
 * * * * beginning optimization
out_week$Fvalue
[1] 2.373441

nothing changed, same outliers detected.

df_week[out_week$ID_outliers,1:2]

plotting in the old way.

par(mfrow=c(1,2))
plot(fd_week[out_week$ID_outliers,],lwd = 1,main = "outliers",col = 2)
plot(fd_week[-out_week$ID_outliers,],lwd = 1,main = "non outliers",col = 3)

trying to remove the weeks with the zeros:

week2_fbagplot <- fbplot(fd_week2, main="Magnitude outliers weekly data")

the default F is:

week2_fbagplot$Fvalue
[1] 1.5

no outliers found

df_week[week2_fbagplot$ID_outliers,1:2]
week_fbagplot2 <- fbplot(fd_week2, main="Magnitude outliers week data",
                                  adjust = list( N_trials = 20,trial_size = fd_week2$N,
                                                 VERBOSE = TRUE ))
 * * * Iteration  1  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  2  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  3  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  4  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  5  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  6  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  7  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  8  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  9  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  10  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  11  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  12  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  13  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  14  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  15  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  16  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  17  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  18  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  19  /  20 
 * * * * beginning optimization
 * * * * optimization finished.
 * * * Iteration  20  /  20 
 * * * * beginning optimization
 * * * * optimization finished.

the chosen F value is:

week_fbagplot2$Fvalue
[1] 0.631645

the outlying years are:

df_week2[week_fbagplot2$ID_outliers,1:2]

first years ando covid year

outiliergram:

invisible(out_week2<- outliergram(fd_week2,adjust = F,lwd = 3,display = F))

the found outliers are:

df_week[out_week$ID_outliers,1:2]

the plot of the original function is not working.

adjusting the F:

out_week2 <- outliergram(fd_week2,lwd = 3,adjust = list( N_trials = 20,trial_size = 8*fd_week$N,
                                                 VERBOSE = TRUE ),display = FALSE)
 * * * Iteration  1  /  20 
 * * * * beginning optimization
 * * * Iteration  2  /  20 
 * * * * beginning optimization
 * * * Iteration  3  /  20 
 * * * * beginning optimization
 * * * Iteration  4  /  20 
 * * * * beginning optimization
 * * * Iteration  5  /  20 
 * * * * beginning optimization
 * * * Iteration  6  /  20 
 * * * * beginning optimization
 * * * Iteration  7  /  20 
 * * * * beginning optimization
 * * * Iteration  8  /  20 
 * * * * beginning optimization
 * * * Iteration  9  /  20 
 * * * * beginning optimization
 * * * Iteration  10  /  20 
 * * * * beginning optimization
 * * * Iteration  11  /  20 
 * * * * beginning optimization
 * * * Iteration  12  /  20 
 * * * * beginning optimization
 * * * Iteration  13  /  20 
 * * * * beginning optimization
 * * * Iteration  14  /  20 
 * * * * beginning optimization
 * * * Iteration  15  /  20 
 * * * * beginning optimization
 * * * Iteration  16  /  20 
 * * * * beginning optimization
 * * * Iteration  17  /  20 
 * * * * beginning optimization
 * * * Iteration  18  /  20 
 * * * * beginning optimization
 * * * Iteration  19  /  20 
 * * * * beginning optimization
 * * * Iteration  20  /  20 
 * * * * beginning optimization
out_week2$Fvalue
[1] 2.353695

nothing changed, same outliers detected.

df_week[out_week2$ID_outliers,1:2]

plotting in the old way.

par(mfrow=c(1,2))
plot(fd_week[out_week2$ID_outliers,],lwd = 1,main = "outliers",col = 2)
plot(fd_week[-out_week2$ID_outliers,],lwd = 1,main = "non outliers",col = 3)

we can do some clustering, no need to allign the data this time:

days <- 1:7
n <- fd_week$N
x <- t(matrix(rep(days,n),7,n))
y <- as.matrix(df_week[,3:9])
k <- 3
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
Information about the data set:
 - Number of observations: 943
 - Number of dimensions: 1
 - Number of points: 7

Information about cluster initialization:
 - Number of clusters: 3
 - Initial seeds for cluster centers:            15          379          723

Information about the methods used within the algorithm:
 - Warping method: none
 - Center method: mean
 - Dissimilarity method: pearson
 - Optimization method: bobyqa

Information about warping parameter bounds:
 - Warping options:    0.1500   0.1500

Information about convergence criteria:
 - Maximum number of iterations: 100
 - Distance relative tolerance: 0.001

Information about parallelization setup:
 - Number of threads: 12
 - Parallel method: 0

Other information:
 - Use fence to robustify: 0
 - Check total dissimilarity: 1
 - Compute overall center: 0

Running k-centroid algorithm:
 - Iteration #1
   * Size of cluster #0: 417
   * Size of cluster #1: 312
   * Size of cluster #2: 214
 - Iteration #2
   * Size of cluster #0: 385
   * Size of cluster #1: 293
   * Size of cluster #2: 265
 - Iteration #3
   * Size of cluster #0: 369
   * Size of cluster #1: 292
   * Size of cluster #2: 282
 - Iteration #4
   * Size of cluster #0: 361
   * Size of cluster #1: 282
   * Size of cluster #2: 300
 - Iteration #5
   * Size of cluster #0: 351
   * Size of cluster #1: 273
   * Size of cluster #2: 319
 - Iteration #6
   * Size of cluster #0: 341
   * Size of cluster #1: 272
   * Size of cluster #2: 330
 - Iteration #7
   * Size of cluster #0: 336
   * Size of cluster #1: 273
   * Size of cluster #2: 334
 - Iteration #8
   * Size of cluster #0: 337
   * Size of cluster #1: 268
   * Size of cluster #2: 338
 - Iteration #9
   * Size of cluster #0: 334
   * Size of cluster #1: 269
   * Size of cluster #2: 340
 - Iteration #10
   * Size of cluster #0: 335
   * Size of cluster #1: 269
   * Size of cluster #2: 339
 - Iteration #11
   * Size of cluster #0: 331
   * Size of cluster #1: 270
   * Size of cluster #2: 342
 - Iteration #12
   * Size of cluster #0: 328
   * Size of cluster #1: 268
   * Size of cluster #2: 347
 - Iteration #13
   * Size of cluster #0: 327
   * Size of cluster #1: 268
   * Size of cluster #2: 348
 - Iteration #14
   * Size of cluster #0: 326
   * Size of cluster #1: 268
   * Size of cluster #2: 349
 - Iteration #15
   * Size of cluster #0: 326
   * Size of cluster #1: 268
   * Size of cluster #2: 349
 - Iteration #16
   * Size of cluster #0: 325
   * Size of cluster #1: 266
   * Size of cluster #2: 352

Active stopping criteria:
 - The total dissimilarity did not decrease.
   user  system elapsed 
  2.482   0.130   2.584 
autoplot(fdakma0der,type = "amplitude")

selecting the number of clusters:

n_sub <- 50
sub_id <- sample(1:n,n_sub,replace = FALSE)
x_sub <- x[sub_id,]
y_sub <- y[sub_id,]

system.time(invisible(comparison_kmeans <- compare_caps(
  x_sub,
  y_sub,
  n_clusters = 2:5,
  metric = "pearson",
  clustering_method = "kmeans",
  warping_class = "none",
  centroid_type = "mean",
  cluster_on_phase = FALSE
    )
  )
)
plot(comparison_kmeans, validation_criterion = "wss", what = "mean",lwd = 3)

plot(comparison_kmeans, validation_criterion = "wss", what = "distribution")

plot(comparison_kmeans, validation_criterion = "silhouette", what = "mean")

plot(comparison_kmeans, validation_criterion = "silhouette", what = "distribution")

2 is probably the best, lets take a look:

k <- 2
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
Information about the data set:
 - Number of observations: 943
 - Number of dimensions: 1
 - Number of points: 7

Information about cluster initialization:
 - Number of clusters: 2
 - Initial seeds for cluster centers:           508          111

Information about the methods used within the algorithm:
 - Warping method: none
 - Center method: mean
 - Dissimilarity method: pearson
 - Optimization method: bobyqa

Information about warping parameter bounds:
 - Warping options:    0.1500   0.1500

Information about convergence criteria:
 - Maximum number of iterations: 100
 - Distance relative tolerance: 0.001

Information about parallelization setup:
 - Number of threads: 12
 - Parallel method: 0

Other information:
 - Use fence to robustify: 0
 - Check total dissimilarity: 1
 - Compute overall center: 0

Running k-centroid algorithm:
 - Iteration #1
   * Size of cluster #0: 863
   * Size of cluster #1: 80
 - Iteration #2
   * Size of cluster #0: 798
   * Size of cluster #1: 145
 - Iteration #3
   * Size of cluster #0: 720
   * Size of cluster #1: 223
 - Iteration #4
   * Size of cluster #0: 663
   * Size of cluster #1: 280
 - Iteration #5
   * Size of cluster #0: 613
   * Size of cluster #1: 330
 - Iteration #6
   * Size of cluster #0: 574
   * Size of cluster #1: 369
 - Iteration #7
   * Size of cluster #0: 547
   * Size of cluster #1: 396
 - Iteration #8
   * Size of cluster #0: 533
   * Size of cluster #1: 410
 - Iteration #9
   * Size of cluster #0: 521
   * Size of cluster #1: 422
 - Iteration #10
   * Size of cluster #0: 510
   * Size of cluster #1: 433
 - Iteration #11
   * Size of cluster #0: 503
   * Size of cluster #1: 440
 - Iteration #12
   * Size of cluster #0: 496
   * Size of cluster #1: 447
 - Iteration #13
   * Size of cluster #0: 491
   * Size of cluster #1: 452
 - Iteration #14
   * Size of cluster #0: 490
   * Size of cluster #1: 453
 - Iteration #15
   * Size of cluster #0: 490
   * Size of cluster #1: 453

Active stopping criteria:
 - Memberships did not change.
   user  system elapsed 
  1.790   0.093   1.865 
autoplot(fdakma0der,type = "amplitude")

table(df_week[fdakma0der$memberships==1,2])
week_of_year
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 
10 10  4  9  7  6  5 10 10  4  7  9  9 11 13 12 12 13 13  7 10 14 10 10 11 10 12  9  5 11 12 13 
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 
10 14 14 12 13 10 10 11  7  5 11 13  6  9  5  9  7  4  1  7  4 
table(df_week[fdakma0der$memberships==2,2])
week_of_year
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 
 8  8 14  9 11 12 13  8  8 14 11  9  9  7  5  6  6  5  5 11  8  4  8  8  7  8  6  9 13  7  6  5 
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 
 8  4  4  6  5  8  8  7 11 13  7  5 12  9 13  9 11 14 17 11  3 
id1 <- which(fdakma0der$memberships==1)
id2 <- which(fdakma0der$memberships==2)
par(mfrow=c(1,2))
plot(fd_week[id1[1:30],],lwd = 1,main = "cluster 1",col = 2)
plot(fd_week[id2[1:30],],lwd = 1,main = "cluster 2",col = 3)

this is probably holiday vs not holiday

n2 <- fd_week2$N
x2 <- t(matrix(rep(days,n2),7,n2))
y2 <- as.matrix(df_week2[,3:9])
k <- 2
system.time(
fdakma0der <- fdakmeans(x = x2,y = y2, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
Information about the data set:
 - Number of observations: 900
 - Number of dimensions: 1
 - Number of points: 7

Information about cluster initialization:
 - Number of clusters: 2
 - Initial seeds for cluster centers:           227          688

Information about the methods used within the algorithm:
 - Warping method: none
 - Center method: mean
 - Dissimilarity method: pearson
 - Optimization method: bobyqa

Information about warping parameter bounds:
 - Warping options:    0.1500   0.1500

Information about convergence criteria:
 - Maximum number of iterations: 100
 - Distance relative tolerance: 0.001

Information about parallelization setup:
 - Number of threads: 12
 - Parallel method: 0

Other information:
 - Use fence to robustify: 0
 - Check total dissimilarity: 1
 - Compute overall center: 0

Running k-centroid algorithm:
 - Iteration #1
   * Size of cluster #0: 103
   * Size of cluster #1: 797
 - Iteration #2
   * Size of cluster #0: 195
   * Size of cluster #1: 705
 - Iteration #3
   * Size of cluster #0: 270
   * Size of cluster #1: 630
 - Iteration #4
   * Size of cluster #0: 329
   * Size of cluster #1: 571
 - Iteration #5
   * Size of cluster #0: 362
   * Size of cluster #1: 538
 - Iteration #6
   * Size of cluster #0: 383
   * Size of cluster #1: 517
 - Iteration #7
   * Size of cluster #0: 401
   * Size of cluster #1: 499
 - Iteration #8
   * Size of cluster #0: 411
   * Size of cluster #1: 489
 - Iteration #9
   * Size of cluster #0: 414
   * Size of cluster #1: 486
 - Iteration #10
   * Size of cluster #0: 417
   * Size of cluster #1: 483
 - Iteration #11
   * Size of cluster #0: 420
   * Size of cluster #1: 480
 - Iteration #12
   * Size of cluster #0: 425
   * Size of cluster #1: 475
 - Iteration #13
   * Size of cluster #0: 427
   * Size of cluster #1: 473
 - Iteration #14
   * Size of cluster #0: 429
   * Size of cluster #1: 471
 - Iteration #15
   * Size of cluster #0: 430
   * Size of cluster #1: 470
 - Iteration #16
   * Size of cluster #0: 430
   * Size of cluster #1: 470

Active stopping criteria:
 - Memberships did not change.
   user  system elapsed 
  1.706   0.084   1.768 
autoplot(fdakma0der,type = "amplitude")

table(df_week2[fdakma0der$memberships==1,2])
week_of_year
 2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 
 8 14  9 11 12 13  8  8 14 11  9  9  7  5  6  5  5  5 11  8  4  8  8  7  8  6  9 13  7  6  5  8 
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 
 4  4  6  5  9  8  7 11 13  7  4 12  9 13  9 11 14 17 
table(df_week2[fdakma0der$memberships==2,2])
week_of_year
 2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 
10  4  9  7  6  5 10 10  4  7  9  9 11 13 12 13 13 13  7 10 14 10 10 11 10 12  9  5 11 12 13 10 
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 
14 14 12 13  9 10 11  7  5 11 14  6  9  5  9  7  4  1 
id1 <- which(fdakma0der$memberships==1)
id2 <- which(fdakma0der$memberships==2)
par(mfrow=c(1,2))
plot(fd_week2[id1[1:30],],lwd = 1,main = "cluster 1",col = 2)
plot(fd_week2[id2[1:30],],lwd = 1,main = "cluster 2",col = 3)

no difference for excluding the problematic weeks.

k <- 3
system.time(
fdakma0der <- fdakmeans(x = x2,y = y2, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
Information about the data set:
 - Number of observations: 900
 - Number of dimensions: 1
 - Number of points: 7

Information about cluster initialization:
 - Number of clusters: 3
 - Initial seeds for cluster centers:           592          357          277

Information about the methods used within the algorithm:
 - Warping method: none
 - Center method: mean
 - Dissimilarity method: pearson
 - Optimization method: bobyqa

Information about warping parameter bounds:
 - Warping options:    0.1500   0.1500

Information about convergence criteria:
 - Maximum number of iterations: 100
 - Distance relative tolerance: 0.001

Information about parallelization setup:
 - Number of threads: 12
 - Parallel method: 0

Other information:
 - Use fence to robustify: 0
 - Check total dissimilarity: 1
 - Compute overall center: 0

Running k-centroid algorithm:
 - Iteration #1
   * Size of cluster #0: 383
   * Size of cluster #1: 156
   * Size of cluster #2: 361
 - Iteration #2
   * Size of cluster #0: 353
   * Size of cluster #1: 185
   * Size of cluster #2: 362
 - Iteration #3
   * Size of cluster #0: 336
   * Size of cluster #1: 202
   * Size of cluster #2: 362
 - Iteration #4
   * Size of cluster #0: 318
   * Size of cluster #1: 219
   * Size of cluster #2: 363
 - Iteration #5
   * Size of cluster #0: 306
   * Size of cluster #1: 230
   * Size of cluster #2: 364
 - Iteration #6
   * Size of cluster #0: 293
   * Size of cluster #1: 241
   * Size of cluster #2: 366
 - Iteration #7
   * Size of cluster #0: 291
   * Size of cluster #1: 247
   * Size of cluster #2: 362
 - Iteration #8
   * Size of cluster #0: 291
   * Size of cluster #1: 249
   * Size of cluster #2: 360
 - Iteration #9
   * Size of cluster #0: 289
   * Size of cluster #1: 251
   * Size of cluster #2: 360
 - Iteration #10
   * Size of cluster #0: 287
   * Size of cluster #1: 252
   * Size of cluster #2: 361
 - Iteration #11
   * Size of cluster #0: 284
   * Size of cluster #1: 253
   * Size of cluster #2: 363
 - Iteration #12
   * Size of cluster #0: 282
   * Size of cluster #1: 256
   * Size of cluster #2: 362
 - Iteration #13
   * Size of cluster #0: 279
   * Size of cluster #1: 260
   * Size of cluster #2: 361

Active stopping criteria:
 - The total dissimilarity did not decrease.
   user  system elapsed 
  1.863   0.111   1.956 
autoplot(fdakma0der,type = "amplitude")

table(df_week2[fdakma0der$memberships==1,2])
week_of_year
 2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 
 3  3  7  6  2  3  6  6  4  5  5  5  6  9  4  7  5  5  5  8  6  3  5  7  7  7  8  4 12  9 11  6 
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
10  7  6  5  9  7  7  4  2  8  9  4  4  3  3  4  1 
table(df_week2[fdakma0der$memberships==2,2])
week_of_year
 2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 
10  2  2  4  8  4  5  5  1  3  7  5  9  5  9  6  9  9  3  5  8  7  5  4  5  5  2  2  3  4  3  5 
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 
 4  9  7  8  2  4  6  4  6  4  6  4  7  4  7  5  4  1 
table(df_week2[fdakma0der$memberships==3,2])
week_of_year
 2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 
 5 13  9  8  8 11  7  7 13 10  6  8  3  4  5  5  4  4 10  5  4  8  8  7  6  6  8 12  3  5  4  7 
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 
 4  2  5  5  7  7  5 10 10  6  3 10  7 11  8  9 13 17 
id1 <- which(fdakma0der$memberships==1)
id2 <- which(fdakma0der$memberships==2)
id3 <- which(fdakma0der$memberships==3)
par(mfrow=c(1,3))
plot(fd_week2[id1[1:30],],lwd = 1,main = "cluster 1",col = 2)
plot(fd_week2[id2[1:30],],lwd = 1,main = "cluster 2",col = 3)
plot(fd_week2[id3[1:30],],lwd = 1,main = "cluster 3",col = 4)

2 an 3 are very similar,

trying with 4

k <- 4
system.time(
fdakma0der <- fdakmeans(x = x2,y = y2, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
Information about the data set:
 - Number of observations: 900
 - Number of dimensions: 1
 - Number of points: 7

Information about cluster initialization:
 - Number of clusters: 4
 - Initial seeds for cluster centers:           869          778          319          413

Information about the methods used within the algorithm:
 - Warping method: none
 - Center method: mean
 - Dissimilarity method: pearson
 - Optimization method: bobyqa

Information about warping parameter bounds:
 - Warping options:    0.1500   0.1500

Information about convergence criteria:
 - Maximum number of iterations: 100
 - Distance relative tolerance: 0.001

Information about parallelization setup:
 - Number of threads: 12
 - Parallel method: 0

Other information:
 - Use fence to robustify: 0
 - Check total dissimilarity: 1
 - Compute overall center: 0

Running k-centroid algorithm:
 - Iteration #1
   * Size of cluster #0: 485
   * Size of cluster #1: 42
   * Size of cluster #2: 354
   * Size of cluster #3: 19
 - Iteration #2
   * Size of cluster #0: 433
   * Size of cluster #1: 109
   * Size of cluster #2: 305
   * Size of cluster #3: 53
 - Iteration #3
   * Size of cluster #0: 387
   * Size of cluster #1: 161
   * Size of cluster #2: 264
   * Size of cluster #3: 88
 - Iteration #4
   * Size of cluster #0: 348
   * Size of cluster #1: 193
   * Size of cluster #2: 243
   * Size of cluster #3: 116
 - Iteration #5
   * Size of cluster #0: 324
   * Size of cluster #1: 213
   * Size of cluster #2: 227
   * Size of cluster #3: 136
 - Iteration #6
   * Size of cluster #0: 309
   * Size of cluster #1: 224
   * Size of cluster #2: 216
   * Size of cluster #3: 151
 - Iteration #7
   * Size of cluster #0: 295
   * Size of cluster #1: 231
   * Size of cluster #2: 208
   * Size of cluster #3: 166
 - Iteration #8
   * Size of cluster #0: 285
   * Size of cluster #1: 234
   * Size of cluster #2: 208
   * Size of cluster #3: 173
 - Iteration #9
   * Size of cluster #0: 279
   * Size of cluster #1: 234
   * Size of cluster #2: 207
   * Size of cluster #3: 180
 - Iteration #10
   * Size of cluster #0: 275
   * Size of cluster #1: 236
   * Size of cluster #2: 204
   * Size of cluster #3: 185
 - Iteration #11
   * Size of cluster #0: 275
   * Size of cluster #1: 240
   * Size of cluster #2: 201
   * Size of cluster #3: 184
 - Iteration #12
   * Size of cluster #0: 273
   * Size of cluster #1: 241
   * Size of cluster #2: 200
   * Size of cluster #3: 186
 - Iteration #13
   * Size of cluster #0: 273
   * Size of cluster #1: 240
   * Size of cluster #2: 200
   * Size of cluster #3: 187
 - Iteration #14
   * Size of cluster #0: 272
   * Size of cluster #1: 239
   * Size of cluster #2: 200
   * Size of cluster #3: 189
 - Iteration #15
   * Size of cluster #0: 271
   * Size of cluster #1: 237
   * Size of cluster #2: 200
   * Size of cluster #3: 192
 - Iteration #16
   * Size of cluster #0: 271
   * Size of cluster #1: 236
   * Size of cluster #2: 199
   * Size of cluster #3: 194
 - Iteration #17
   * Size of cluster #0: 272
   * Size of cluster #1: 237
   * Size of cluster #2: 199
   * Size of cluster #3: 192
 - Iteration #18
   * Size of cluster #0: 273
   * Size of cluster #1: 238
   * Size of cluster #2: 198
   * Size of cluster #3: 191
 - Iteration #19
   * Size of cluster #0: 273
   * Size of cluster #1: 239
   * Size of cluster #2: 198
   * Size of cluster #3: 190

Active stopping criteria:
 - The total dissimilarity did not decrease.
   user  system elapsed 
  3.143   0.170   3.291 
autoplot(fdakma0der,type = "amplitude")

table(df_week2[fdakma0der$memberships==1,2])
week_of_year
 2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 
 3  8  5  6  6 10  3  6 13  9  4  7  4  3  3  4  5  3  7  6  3  7  6  5  6  5  5  9  3  5  2  5 
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 
 4  2  4  4  7  6  2  6  5  4  1  8  6  8  6  5  9 10 
table(df_week2[fdakma0der$memberships==2,2])
week_of_year
 2  3  4  5  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 
 3  3  5  6  2  6  5  3  4  3  5  6 10  5  7  5  4  5  6  6  2  5  6  6  6  5  4  8  8  9  6 10 
35 36 37 38 39 40 41 43 44 45 46 47 48 49 50 
10  5  5  4  7  6  3  7  6  2  2  1  3  2  1 
table(df_week2[fdakma0der$memberships==3,2])
week_of_year
 2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25 26 27 28 29 30 31 32 33 34 
 9  2  5  3  8  3  7  3  2  2  7  1  2  2  2  3  3  3  1  5  4  5  5  5  3  4  3  2  2  2  4  3 
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 
 5  4  3  4  7  1  7  4  8  4  6  5  6  7  5  7 
table(df_week2[fdakma0der$memberships==4,2])
week_of_year
 2  3  4  5  6  7  8  9 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 
 3  5  3  3  4  3  2  4  3  4  5  6  3  8  4  5  8  5  1  9  5  2  2  1  4  4  2  5  3  5  3  1 
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 
 6  4  5  4  1  3  8  6  3  3  4  4  4  3  4  3  1 
id1 <- which(fdakma0der$memberships==1)
id2 <- which(fdakma0der$memberships==2)
id3 <- which(fdakma0der$memberships==3)
id4 <- which(fdakma0der$memberships==4)
par(mfrow=c(2,2))
plot(fd_week2[id1[1:30],],lwd = 1,main = "cluster 1",col = 2)
plot(fd_week2[id2[1:30],],lwd = 1,main = "cluster 2",col = 3)
plot(fd_week2[id3[1:30],],lwd = 1,main = "cluster 3",col = 4)
plot(fd_week2[id4[1:30],],lwd = 1,main = "cluster 4",col = 5)

still better with 2.

system.time(
  hclustres <- fdahclust(
  x = x2,
  y = y2,
  n_clusters = 2L,
  warping_class = "none",
  centroid_type = "mean",
  metric = "pearson",
  linkage_criterion = "complete",
  cluster_on_phase = FALSE,
  use_verbose = TRUE,
  warping_options = c(0.15, 0.15),
  maximum_number_of_iterations = 100L,
  number_of_threads = 12L,
  parallel_method = 0L,
  distance_relative_tolerance = 0.001,
  use_fence = FALSE,
  check_total_dissimilarity = TRUE,
  compute_overall_center = FALSE
)
)
ℹ Computing the distance matrix...
autoplot(hclustres)

diagnostic_plot(hclustres)

table(hclustres$memberships)

  1   2 
820  80 
matplot(t(hclustres$center_curves[,1,]),type = 'l',
        main='clustered and alligned curves',xlab='days',ylab='crashes', col = 1:5,lwd = 3)

the pattern is always the same, division of the two clusters.

k <- 2
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
Information about the data set:
 - Number of observations: 943
 - Number of dimensions: 1
 - Number of points: 7

Information about cluster initialization:
 - Number of clusters: 2
 - Initial seeds for cluster centers:           462          379

Information about the methods used within the algorithm:
 - Warping method: none
 - Center method: mean
 - Dissimilarity method: pearson
 - Optimization method: bobyqa

Information about warping parameter bounds:
 - Warping options:    0.1500   0.1500

Information about convergence criteria:
 - Maximum number of iterations: 100
 - Distance relative tolerance: 0.001

Information about parallelization setup:
 - Number of threads: 12
 - Parallel method: 0

Other information:
 - Use fence to robustify: 0
 - Check total dissimilarity: 1
 - Compute overall center: 0

Running k-centroid algorithm:
 - Iteration #1
   * Size of cluster #0: 452
   * Size of cluster #1: 491
 - Iteration #2
   * Size of cluster #0: 450
   * Size of cluster #1: 493
 - Iteration #3
   * Size of cluster #0: 447
   * Size of cluster #1: 496
 - Iteration #4
   * Size of cluster #0: 442
   * Size of cluster #1: 501
 - Iteration #5
   * Size of cluster #0: 442
   * Size of cluster #1: 501
 - Iteration #6
   * Size of cluster #0: 457
   * Size of cluster #1: 486
 - Iteration #7
   * Size of cluster #0: 464
   * Size of cluster #1: 479
 - Iteration #8
   * Size of cluster #0: 480
   * Size of cluster #1: 463
 - Iteration #9
   * Size of cluster #0: 481
   * Size of cluster #1: 462
 - Iteration #10
   * Size of cluster #0: 478
   * Size of cluster #1: 465
 - Iteration #11
   * Size of cluster #0: 484
   * Size of cluster #1: 459
 - Iteration #12
   * Size of cluster #0: 493
   * Size of cluster #1: 450
 - Iteration #13
   * Size of cluster #0: 495
   * Size of cluster #1: 448
 - Iteration #14
   * Size of cluster #0: 491
   * Size of cluster #1: 452
 - Iteration #15
   * Size of cluster #0: 487
   * Size of cluster #1: 456
 - Iteration #16
   * Size of cluster #0: 487
   * Size of cluster #1: 456
 - Iteration #17
   * Size of cluster #0: 488
   * Size of cluster #1: 455
 - Iteration #18
   * Size of cluster #0: 488
   * Size of cluster #1: 455

Active stopping criteria:
 - Memberships did not change.
   user  system elapsed 
  1.754   0.098   1.848 
autoplot(fdakma0der,type = "amplitude")
table(df_week[fdakma0der$memberships==1,2])
week_of_year
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 
10 10  4  9  7  6  5 10 10  4  7  9  9 11 13 12 12 13 13  7 10 14 10 10 11 10 12  9  5 11 12 13 
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 
10 14 14 12 13  9 10 11  7  5 11 13  6  9  5  9  7  4  1  6  4 
table(df_week[fdakma0der$memberships==2,2])
week_of_year
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 
 8  8 14  9 11 12 13  8  8 14 11  9  9  7  5  6  6  5  5 11  8  4  8  8  7  8  6  9 13  7  6  5 
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 
 8  4  4  6  5  9  8  7 11 13  7  5 12  9 13  9 11 14 17 12  3 
id1 <- which(fdakma0der$memberships==1)
id2 <- which(fdakma0der$memberships==2)
par(mfrow=c(1,2))
plot(fd_week[id1[1:30],],lwd = 1,main = "cluster 1",col = 2)
plot(fd_week[id2[1:30],],lwd = 1,main = "cluster 2",col = 3)

plotting the weeks differently:

df <- data.frame(table(df_week[fdakma0der$memberships==1,2]),table(df_week[fdakma0der$memberships==2,2])) %>% select(-week_of_year.1)
df %>% ggplot(aes(order(week_of_year),Freq)) + geom_line()

this shows that group 1 are the work week, while cluster 2 are the holiays.

---
title: "Nonparametric Analysis of UK Road Accidents"
subtitle: "Functional clustering monthly data"
author:
    - "Valeria Iapaolo"
    - "Oswaldo Morales"
    - "Riccardo Morandi"
    - "Abylai Orynbassar"
    
output: html_notebook
---

```{r setup, echo = FALSE}
knitr::opts_chunk$set(
    echo = TRUE,
    #dev = c('pdf'),
    fig.align = 'center',
    #fig.path = 'output/',
    fig.height = 6,
    fig.width = 12
)
```

```{r libraries inclusions, include=FALSE}
library(tidyverse)
library(roahd)
#library(fda)
library(fdacluster)
```

```{r}
load("~/Documents/Nonparametric Statisics/Project/clean data/functional/df_week.RData")
```

first we start with outlier detection in the functional case

```{r}
fd_week <- fData(1:7,as_tibble(df_week[,3:9]))
```

```{r}
plot(fd_week,lwd = 3,xlab = "day",ylab = "dayly number of crashes",main = "dayly crashes in each week")
```

functional bagplot:

```{r}
week_fbagplot <- fbplot(fd_week, main="Magnitude outliers weekly data")
```

the default F is:

```{r}
week_fbagplot$Fvalue
```


no outliers found

```{r}
df_week[week_fbagplot$ID_outliers,1:2]
```


we need to remove the weeks that cause problems witth 0 (1,52,53)

```{r}
df_week2 <- df_week %>% filter(week_of_year %in% 2:51)
fd_week2 <- fData(1:7,as_tibble(df_week2[,3:9]))
```

```{r}
week_fbagplot2 <- fbplot(fd_week2, main="Magnitude outliers week data",
                                  adjust = list( N_trials = 20,trial_size = fd_week2$N,
                                                 VERBOSE = TRUE ))
```

the chosen F value is:

```{r}
week_fbagplot2$Fvalue
```

the outlying years are:

```{r}
df_week2[week_fbagplot2$ID_outliers,1:2]
```

first years ando covid year


outiliergram:

```{r}
invisible(out_week<- outliergram(fd_week,adjust = F,lwd = 3,display = F))
```

the found outliers are:

```{r}
df_week[out_week$ID_outliers,1:2]
```

the plot of the original function is not working.

adjusting the F:

```{r}
out_week <- outliergram(fd_week,lwd = 3,adjust = list( N_trials = 20,trial_size = 8*fd_week$N,
                                                 VERBOSE = TRUE ),display = FALSE)
```

```{r}
out_week$Fvalue
```

nothing changed, same outliers detected.

```{r}
df_week[out_week$ID_outliers,1:2]
```

plotting in the old way.

```{r}
par(mfrow=c(1,2))
plot(fd_week[out_week$ID_outliers,],lwd = 1,main = "outliers",col = 2)
plot(fd_week[-out_week$ID_outliers,],lwd = 1,main = "non outliers",col = 3)
```

trying to remove the weeks with the zeros:

```{r}
week2_fbagplot <- fbplot(fd_week2, main="Magnitude outliers weekly data")
```

the default F is:

```{r}
week2_fbagplot$Fvalue
```


no outliers found

```{r}
df_week[week2_fbagplot$ID_outliers,1:2]
```

```{r}
week_fbagplot2 <- fbplot(fd_week2, main="Magnitude outliers week data",
                                  adjust = list( N_trials = 20,trial_size = fd_week2$N,
                                                 VERBOSE = TRUE ))
```

the chosen F value is:

```{r}
week_fbagplot2$Fvalue
```

the outlying years are:

```{r}
df_week2[week_fbagplot2$ID_outliers,1:2]
```

first years ando covid year


outiliergram:

```{r}
invisible(out_week2<- outliergram(fd_week2,adjust = F,lwd = 3,display = F))
```

the found outliers are:

```{r}
df_week[out_week$ID_outliers,1:2]
```

the plot of the original function is not working.

adjusting the F:

```{r}
out_week2 <- outliergram(fd_week2,lwd = 3,adjust = list( N_trials = 20,trial_size = 8*fd_week$N,
                                                 VERBOSE = TRUE ),display = FALSE)
```

```{r}
out_week2$Fvalue
```

nothing changed, same outliers detected.

```{r}
df_week[out_week2$ID_outliers,1:2]
```

plotting in the old way.

```{r}
par(mfrow=c(1,2))
plot(fd_week[out_week2$ID_outliers,],lwd = 1,main = "outliers",col = 2)
plot(fd_week[-out_week2$ID_outliers,],lwd = 1,main = "non outliers",col = 3)
```

we can do some clustering, no need to allign the data this time:

```{r}
days <- 1:7
n <- fd_week$N
x <- t(matrix(rep(days,n),7,n))
y <- as.matrix(df_week[,3:9])
```

```{r}
k <- 3
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
```

```{r}
autoplot(fdakma0der,type = "amplitude")
```

selecting the number of clusters:

```{r}
n_sub <- 50
sub_id <- sample(1:n,n_sub,replace = FALSE)
x_sub <- x[sub_id,]
y_sub <- y[sub_id,]

system.time(invisible(comparison_kmeans <- compare_caps(
  x_sub,
  y_sub,
  n_clusters = 2:5,
  metric = "pearson",
  clustering_method = "kmeans",
  warping_class = "none",
  centroid_type = "mean",
  cluster_on_phase = FALSE
    )
  )
)
```

```{r}
plot(comparison_kmeans, validation_criterion = "wss", what = "mean",lwd = 3)
```

```{r}
plot(comparison_kmeans, validation_criterion = "wss", what = "distribution")
```

```{r}
plot(comparison_kmeans, validation_criterion = "silhouette", what = "mean")
```

```{r}
plot(comparison_kmeans, validation_criterion = "silhouette", what = "distribution")
```


2 is probably the best, lets take a look:

```{r}
k <- 2
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
```

```{r}
autoplot(fdakma0der,type = "amplitude")
```

```{r}
table(df_week[fdakma0der$memberships==1,2])
```

```{r}
table(df_week[fdakma0der$memberships==2,2])
```

```{r}
id1 <- which(fdakma0der$memberships==1)
id2 <- which(fdakma0der$memberships==2)
par(mfrow=c(1,2))
plot(fd_week[id1[1:30],],lwd = 1,main = "cluster 1",col = 2)
plot(fd_week[id2[1:30],],lwd = 1,main = "cluster 2",col = 3)
```

this is probably holiday vs not holiday

```{r}
n2 <- fd_week2$N
x2 <- t(matrix(rep(days,n2),7,n2))
y2 <- as.matrix(df_week2[,3:9])
```

```{r}
k <- 2
system.time(
fdakma0der <- fdakmeans(x = x2,y = y2, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
```

```{r}
autoplot(fdakma0der,type = "amplitude")
```

```{r}
table(df_week2[fdakma0der$memberships==1,2])
```

```{r}
table(df_week2[fdakma0der$memberships==2,2])
```

```{r}
id1 <- which(fdakma0der$memberships==1)
id2 <- which(fdakma0der$memberships==2)
par(mfrow=c(1,2))
plot(fd_week2[id1[1:30],],lwd = 1,main = "cluster 1",col = 2)
plot(fd_week2[id2[1:30],],lwd = 1,main = "cluster 2",col = 3)
```

no difference for excluding the problematic weeks.

```{r}
k <- 3
system.time(
fdakma0der <- fdakmeans(x = x2,y = y2, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
```

```{r}
autoplot(fdakma0der,type = "amplitude")
```

```{r}
table(df_week2[fdakma0der$memberships==1,2])
```

```{r}
table(df_week2[fdakma0der$memberships==2,2])
```

```{r}
table(df_week2[fdakma0der$memberships==3,2])
```


```{r}
id1 <- which(fdakma0der$memberships==1)
id2 <- which(fdakma0der$memberships==2)
id3 <- which(fdakma0der$memberships==3)
par(mfrow=c(1,3))
plot(fd_week2[id1[1:30],],lwd = 1,main = "cluster 1",col = 2)
plot(fd_week2[id2[1:30],],lwd = 1,main = "cluster 2",col = 3)
plot(fd_week2[id3[1:30],],lwd = 1,main = "cluster 3",col = 4)
```

2 an 3 are very similar,

trying with 4

```{r}
k <- 4
system.time(
fdakma0der <- fdakmeans(x = x2,y = y2, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
```

```{r}
autoplot(fdakma0der,type = "amplitude")
```

```{r}
table(df_week2[fdakma0der$memberships==1,2])
```

```{r}
table(df_week2[fdakma0der$memberships==2,2])
```

```{r}
table(df_week2[fdakma0der$memberships==3,2])
```

```{r}
table(df_week2[fdakma0der$memberships==4,2])
```

```{r}
id1 <- which(fdakma0der$memberships==1)
id2 <- which(fdakma0der$memberships==2)
id3 <- which(fdakma0der$memberships==3)
id4 <- which(fdakma0der$memberships==4)
par(mfrow=c(2,2))
plot(fd_week2[id1[1:30],],lwd = 1,main = "cluster 1",col = 2)
plot(fd_week2[id2[1:30],],lwd = 1,main = "cluster 2",col = 3)
plot(fd_week2[id3[1:30],],lwd = 1,main = "cluster 3",col = 4)
plot(fd_week2[id4[1:30],],lwd = 1,main = "cluster 4",col = 5)
```


still better with 2.

```{r}
system.time(
  hclustres <- fdahclust(
  x = x2,
  y = y2,
  n_clusters = 2L,
  warping_class = "none",
  centroid_type = "mean",
  metric = "pearson",
  linkage_criterion = "complete",
  cluster_on_phase = FALSE,
  use_verbose = TRUE,
  warping_options = c(0.15, 0.15),
  maximum_number_of_iterations = 100L,
  number_of_threads = 12L,
  parallel_method = 0L,
  distance_relative_tolerance = 0.001,
  use_fence = FALSE,
  check_total_dissimilarity = TRUE,
  compute_overall_center = FALSE
)
)
```

```{r}
autoplot(hclustres)
```

```{r}
diagnostic_plot(hclustres)
```

```{r}
table(hclustres$memberships)
```

```{r}
matplot(t(hclustres$center_curves[,1,]),type = 'l',
        main='clustered and alligned curves',xlab='days',ylab='crashes', col = 1:5,lwd = 3)
```

the pattern is always the same, division of the two clusters.



```{r}
k <- 2
system.time(
fdakma0der <- fdakmeans(x = x,y = y, n_clusters = k,
  seeds = sample(1:n,k),
  warping_class = "none",
  metric = "pearson",
  centroid_type = "mean",
  distance_relative_tolerance = 1e-3,
  add_silhouettes = F,
  parallel_method = 0L,
  number_of_threads = 12
)
)
```

```{r}
autoplot(fdakma0der,type = "amplitude")
```

```{r}
table(df_week[fdakma0der$memberships==1,2])
```

```{r}
table(df_week[fdakma0der$memberships==2,2])
```

```{r}
id1 <- which(fdakma0der$memberships==1)
id2 <- which(fdakma0der$memberships==2)
par(mfrow=c(1,2))
plot(fd_week[id1[1:30],],lwd = 1,main = "cluster 1",col = 2)
plot(fd_week[id2[1:30],],lwd = 1,main = "cluster 2",col = 3)
```

plotting the weeks differently:

```{r}
df <- data.frame(table(df_week[fdakma0der$memberships==1,2]),table(df_week[fdakma0der$memberships==2,2])) %>% select(-week_of_year.1)
df %>% ggplot(aes(order(week_of_year),Freq)) + geom_line()
```

this shows that group 1 are the work week, while cluster 2 are the holiays.